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To perform multiple regression, the least squares estimator is commonly used. 
However, this estimator is not robust to outliers. Therefore, robust meth¬ 
ods such as S-estimation have been proposed. These estimators flag any 
observation with a large residual as an outlier and downweight it in the fur¬ 
ther procedure. However, a large residual may be caused by an outlier in 
only one single predictor variable, and downweighting the complete obser¬ 
vation results in a loss of information. Therefore, we propose the shooting 
S-estimator, a regression estimator that is especially designed for situations 
where a large number of observations suffer from contamination in a small 
number of predictor variables. The shooting S-estimator combines the ideas 
of the coordinate descent algorithm with simple S-regression, which makes 
it robust against componentwise contamination, at the cost of failing the re¬ 
gression equivariance property. 

Keywords: cellwise outliers; componentwise contamination; shooting al¬ 
gorithm; coordinate descent algorithm; regression S-estimation 


1. Introduction 

In robust statistics it is generally assumed that the majority of observations is totally 
free of contamination. Any observation that deviates from the model is as a whole 
flagged as an outlier, even if only one component of the observation is contaminated. In 
case only a small number of predictor variables cause the deviation from the model, a 
lot of information is lost through downweighting the whole observation. Therefore, it 
seems more appropriate to not consider whole observations as outliers but only those 
components that really deviate from the model. This is especially useful if the major¬ 
ity of observations is contaminated in only a small number of variables. Imagine, for 
example, a regression setting where in every observation one single predictor variable is 
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contaminated. Here the usual robust methods break down, as there is not one single 
clean observation. But the majority of the cells of the design matrix is still clean and 
thus the majority of the data is still clean. In this setting, it is more suitable to use 
techniques developed for cellwise contamination (componentwise contamination) rather 
t han those developed f or rowwise contamination. 

Alaallaf et al. 2009l | extend the rowwise contamination model to also cover cellwise 


contamination. They define the influence function and the breakdown point in this set¬ 
ting and derive them for some multivariate location estimators, showin g that these can¬ 


not co pe with cellwise contamination. For principal component analysis. IVan Aelst et al. 


2ninl| develop a method based on pairwise correlation that can deal with cellwise 
contamination. The same authors p ropose versions o f the Stahel-Donoho estimator 


based on Huber i zed o utlyingness [see I Van Aelst et al.l . l2012l | and cellwise weights 


see 


Van Aelst et al.l . 12011 ] 


In this paper we derive a regression estimator, called the shooting S-estimator, that 
can cope with cellwise contaminatio n. It combines the id eas of the coordinate descent 
algorithm (’shooting a lgorithm’) [see Friedman et al. . 200?! . Fu . 1998l | with simple regres¬ 


sion S-estimation [see iMaronna et al.l . l2006l |. In Section [21 we introduce the estimator. 


An algorithm is proposed in Section[3l We show simulation results in Section 0] where we 
compare the shooting S-estimator to the least squares estimator and the robust S- and 
MM-estimators. Real data examples are presented in Section [5] and Section [6] concludes. 


2. Motivation 


Our shooting S-estimator uses the ide a of t he co ordinate descent algorithm [see lFriedman et al.l . 
200^. also called shootine aleorithm 1998]. Orig inally, this method performs, vari¬ 
able by variable, simple lasso regression. iTsenei 200ll ] showed that by iteratively looping 
through all variables, it converges to the lasso estimate for a ny starting va l ue. H owever, 


it is well known that the lasso estimate is not robust [see e.g. lAlfons et al.l . l2013l |. In the 


shooting S-estimator, we a chieve robustness by r eplacing the lasso estimation with un 


penalized S-estimation [see iMaronna et al.l . l2006l | . In contrast to ordinary S-regression, 


the coordinate-wise approach of the coordinate descent algorithm allows us to weight all 
components of an observation differently. 

The lasso estimate is dehned as 


f^Lasso — 


1 


,_,asso = arg mm 
/3eKp 


n 




i=l 


j=i 




fty + 2AVl/’: 

i=i 


j I- 


In the coordinate descent algorithm, to update the estimate of the lasso coefficient j3j 
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(j = 1,... ,p), all other coefficients are kept fixed at pk {k ^ j) 


Lasso = argmin - - V'xjfc/^fc) - Xijl5jf + 2A Y] |/3fc| + 2\\j3j\ 

1 ” 

= argmin-Y]((yi - Y]xjfc/3fc) - Xij^jf + 2A|/3j|. (1) 

This can be seen as simple lasso regression where the new response 

= yi-'^XikPk, i = (2) 

k^j 


is regressed on Xij, for a fixed value of j. 

For the shooting S-estimator, we want to make sure that the new response y 1 , to be 
defined below, is not influenced by outliers in the cells Xik- Therefore, we first define 
regression weights 


Wik = w{ 



- Xikj3k\ ^ 


(3) 


where the argument of the weighting function w{ ) is the residual of regressing on 
Xik, scaled by a robust residual scale ak- Thus, Wik determines the ‘outlyingness’ of the 
cell Xik in the regression on Xik- The weighting function should be non-increasing 
on the positive numbers and take values in the interval [0,1]. Our preferred option - 
for reasons of simplicity - is hard rejection, where w{r) = 1 if r < c and 0 otherwise. 
Choosing the cut-off value c = 3, less than 0.3% of clean observations are expected to be 
flagged as outliers in the regression model with normal errors. Of course, other choices 
for the weight function are possible. 

The new response is defined as 



= 2/i - ^ Xiki^k with Xik 


'^ik^ik “ 1 “ (1 "^ik^^ik 


(4) 


The difference with ([2]) is that in the computation of the new response the values Xik are 

replaced by a convex combination Xik of the observed value Xik and of a ‘corrected’ value 
(k) 

Xik. As we kn ow y\ and /3fc, this ‘corrected’ value Xik is computed through calibration 


Brownl . Il982l |: 


^ik — 


-(fc) 


(5) 


(To avoid computational problems, we set Xik = 0 in case \$k\ is small.) The Xik can 
be interpreted as a cleaned version of the cell value Xik in the design matrix. If an 
observation is flagged as an outlier and gets a zero weight, the Xik equals the ‘corrected’ 
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value Xik- If an observation is declared as clean and gets a weight of one, the cleaned 
version equals the observed value. Note that Xik and Wik depend on /3fc, for k ^ j. 

To compute the regression estimate f3j , we use instead of the lasso as in ([T|) , the robust 
unpenalized simple S-regression estimator. This leads us to the shooting S-estimator, 
which is defined variablewise conditional on knowing the other estimates f3k with k ^ j, 

13j = argmin d'j(/3) ( 6 ) 

/3eK 


with (tj(/3) defined as solution s of the equation 


1 

n 



i=l 


= 5 . 


(7) 


Hence, d'j{/3j) is an M-estimator of scale computed from the residuals. Here 6 equals 
the expected value of the p-function at the normal distribution, i.e. S = E[p(Z)] with 
Z ~ Af(0, 1). It is chosen such that the breakdown point of the estimator is not too low, 
while its efficiency is high eno ugh. A higher value of d impl ies a higher breakdown point, 
but a lower efficiency [see e.g. iRousseeuw and Lerovl . 119871 . Chapter 3.4]. 

As a p-function we will use either Tukey’s biweight 


PBi(z) = 
or the skipped Huber 

Pskniz) = 


-(!-( 


if |z| 

< ksi 



if |z| 

> ksi 

fh’ 

if [zj < 

kskH 


1 ^skH 

1 2 

if zj > 

kskH- 



( 8 ) 

(9) 


These two p-functions are quite different in nature. The skipped Huber loss is quadratic 
in a central region [—kskH-, ksku] and constant outside this interval. Thus, skipped Huber 
is a skipped version of the quadratic loss. In contrast, the biweight loss is designed to 
be smooth while still boundi ng the effect of extrem e values. Apart from those two loss 


functions any p-function [see iMaronna et al.l . 120061 . p31, Def 2.1] could be used as well. 


The shooting S-estimator fulfills some natural equivariance properties. Assume a re¬ 
gression model with intercept. The estimator is computed using the coordinate descent 
algorithm with starting value described in Section [3l If a constant a is added to an 
explanatory variable, the corresponding estimate of the slope coefficient f3j stays un¬ 
changed, while the intercept shifts by aj3j. If a constant o is added to the response, none 
of the estimated coefficients changes and the intercept shifts by a. These properties can 
be shown using Equations (I3D , (HD , US]) and (HD , as well as the properties of the proposed 
initial estimator. If a multiple 7 of an explanatory variable is added to the response, we 
would like the corresponding slope coefficient to become j + I3j. This type of regression 
equivariance is fulhlled if the starting value has this property. Since the proposed start¬ 
ing value uses Huberized values for the predictor variables, this property does not fully 
hold, although one could say that for the converged estimator it ‘practically’ holds. 


4 










3. Algorithm 


To compute the shooting S-estimate described in Section O we use an iterative procedure 
similar to the coordinate descent algorithm. We first describe the iteration steps, and 
afterwards the determination of initial values (see Algorithm [1] for details). We assume 
that the model contains an intercept, denoted by a. 


Loop. In each step of the coordinate descent loop (with fixed j), we calculate the 
by dH) and dS]) and then compute the simple regression S-estimate of the on 
To do t his, we use the i terati vely reweighted least squares (IRLS) algorithm 


,U) 

H 

the X 

recommended bv iMaronna et al.l 2006l |. It consists of another iterative algorithm. In 


each iteration, called an I-step, a weighted least squares estimate of Pj is calculated and 
subsequently, a new value of the M-estimator of scale is computed by searching 

a hxed point of a recursive version of dZl)) f{s) = ^/{n6) ~ Xij$j)/s)s = s. 

Although convergence of the coordinate descent loop is not assured, we have observed 
it empirically in all our simulations studies. 


Initial values. We hrst Huberize the predictor values, and get ‘approximately clean’ 
predictors x^j. Then we use the MM-esti mator to get initial coeffi cients with the 
linear quadratic quadratic (Iqq) p-function Roller and Stahel l201ll | and tuning constants 
set for 50% breakdown point and 95% efficiency. 


Algorithm [T] gives the details. The code of the algorithm is available on the homepage 
of the hrst author. 


4. Simulations 


To evaluate the shooting S-estimator, we compare it to the classi cal least squares estim¬ 


ator (LS), the ordinary S-estimator and the MM-estimator [see IMaronna et al.l . 120061 1 . 


The shooting S-estimator is computed as in Algorithm [T] once with the biweight p- 
function ([5]) and once with the skipped Huber p-function Q. We choose ksi = 3.420 
and kskH = 2.177. This corresponds to a breakdown point of 20% in the simple re¬ 
gressions. Our choice seems to be a good trade-off between robustness and efficiency. 
In practice, the breakdown point needs to be increased if the data at hand is more 
severely contaminated than in this simulation setting. For the computation of the or¬ 
dinary S-estimate, we use the biweight loss function and set again ksi = 3.420. The 
MM-estimator is computed with the standard settings of 50% breakdown point and an 
efficiency of 95% at the normal model, using the biweight loss function. We stick here to 
the high breakdown point of 50%, as MM can achieve high efficiency and a high break¬ 
down point simultaneously. Thus, lowering the breakdown point would not increase the 
efficiency of the MM-estimator. 

For the simulation setup we take n = 100 and p = 15. The regression coefficients (3 are 
taken equally spaced over the interval [0,1], i.e. (Ij = j/p for j = 1,... ,p. The predictors 
Xj and errors e* are independent and identically normally distributed with mean 0 for 
i = 1,..., n. We choose two different sampling schemes, one with uncorrelated and one 
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Algorithm 1 Computation of the shooting S-estimate for a regression model 
with constant term 


# Initialization 

• L •.= 0 # Number of steps in coordinate descent loop 

• = max(mediani(a;ij) — 2 MADi(a;ij), min(a;ij, mediani(a;ij) + 2 MADi(a;ij))) 

-'( 0 ) 

• Compute the slopes /3 , the intercept a and the residual scale s from the MM-regression of 

Hi on the Huberized predictors using the Iqq p-function 

• a A := a, = 1,... ,p 

( 0 ) -- • , 

• s) ' s, j ^l,...,p 

# Coordinate descent loop 
o L := L + I 

<> For j = 1,... ,p # Index of the variable used in regression step 


# Regression step 

yO) fWo(L)_Y^ ~(L- 1 ) Z(L-l) 

* idi • r^k 2^k>j '^ik ^k ’ 

. r := 0 # Number of I-steps 
. res\^’°^ ~ - mediani(j/p' - 


i = 1,... ,n 


p'{res[^’°fs\^ ^^)/{resf^’°'’/sf 


i = 1,..., n 
i = 1,..., n 


# I-steps 
o r := r + 1 

o Compute the slope and the intercept from the 

weighted least squares regression of on Xij with 
weights #j is fixed 


(L.r) ~(j) • 1 

o res] ^ := — XijPj “ ^ 2 = 1,..., n 

o i ~ 0 # Number of M-steps to compute scale 
(mediaui • 1.4826 if r = 1 

O so = <^ ^L,r-l) .f ^ 1 

[ s} if r > 1 

# M-step 
A £:=£-\-l 


A sr:=^i^ELiP(^) 

A Repeat M-step until | ^ — 1| < ei = 10“ 


o ;= 


o ■— p'(res-^’’'Vs^^’’’^)/(res-^’’'Vs^^’'’^), 

o Repeat I-step until maxi Ires^^’^^ — res\^’^~^\ < £2 
# £2 = 10 -®MADi Vi 


i = 1...., n 


. /3f > := /3f 
. af := df 
. sf^ := 

. res^P := resf"'^'' i = 1,..., n 
/rs r frlR _ A,W\/aW ; 


(i) J iVi - « ■ l/P) if 1/3, 1 > £3 


i = 1,...,n 


I median! Xij otherwise 

# £3 = 10-^(MADi yi)/(MADi Xij) 

• ■=w{res\^fs^p) i = l,...,n 

~(L) (L) , /, (L)^~(L) . , 

■ xlj := wlj 'xij -b (1 - wh 

D 

o # End for-loop 

o Repeat coordinate descent loop until Ej=i |Sj'^^ — < £4 

# £4 = 10~^ MADi yi 

• Pj ■■= /3f ^ 

• d := mediani(pi - E?=i Pf'’) 




with correlated predictors. For the first one, we use the identity matrix as a covariance 
matrix for the predictors. The error variance is cr^ = 0.5^. In the correlated setting 
we choose the predictor covariance matrix E with Sjj = and the error variance 

= 0.81^. By this the signal-to-noise ratic0 is the same in both settings. The response 
variable is then created as yi = x'/3 + for i = 1,..., n. 

To every generated data set, we add 1%, 2%, 5% and 10% of cellwise contamina¬ 
tion. The cells Xij that we contaminate are chosen randomly from the design matrix 
X. So every cell of our data set is equally likely to be contaminated. Three different 
contamination settings are used: a dense cluster ~ AA(50,1), scattered outliers 

x?®”* ~ A/’(0,100^) and a wide cluster ~ AA(50,10^). We only contaminate the 
x-values and not the y-values, which creates bad leverage points. For comparison, we 
also construct classical contamination settings where we choose whole rows for contam¬ 
ination instead of cells. For these we choose the three contaminations x?””* ~ A/'(50, S), 
^cont ^ 100^ • E) and ~ A/’(50, 10^ ■ E). Additionally, we also want to demon¬ 

strate that the shooting S-algorithm can deal with contamination in the response. From 
the clean data set, we select 1%, 2%, 5% and 10% of observations and generate their 
error terms as econt ~ W(50,(T^) to create vertical outliers. 

To compare the different estimators, we apply them to ii = 1000 generated data sets. 
For each data set, we compute the mean squared error (MSE) 

^ j=l r=l 

Additionally, also the bias or the median squared error could be used as evaluation 
methods. We omit them as they are in line with the MSE. 

The simulation results for cellwise contamination are displayed in Tables [T] and [2] 
for uncorrelated and correlated predictors, respectively. Table [3] gives the results for 
rowwise contamination in the data set with correlated predictors. Table 0] illustrates the 
behavior of the estimators in presence of vertical outliers for correlated predictors. The 
standard errors around the reported results are smaller than 4% of the reported numbers 
in all tables. We omit the results for rowwise contamination and vertical outliers for 
uncorrelated predictors as they are comparable to the ones in the correlated case. 

For uncorrelated predictors, Table [1] demonstrates the need of a new method that can 
deal with cellwise contamination. As well known, LS breaks down for any amount of 
contamination. But also the robust MM- and S-estimator have problems with larger 
amounts of cellwise contamination. As 2% of cellwise contamination corresponds in this 
setting to about 20 — 30% of rowwise contaminatiorU, the ordinary S-estimator already 
breaks down. As we have chosen a breakdown point of 50% for MM, it can deal with 
slightly higher contamination. But for about 5% of cellwise contamination it also breaks 
down. In contrast, the shooting S-estimators can deal with much higher levels of cellwise 

'^The signal-to-noise ratio equals 

^The expected value of the number of contaminated rows is n(l — (1 — e)^) for a cellwise contamination 
level e. 
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Table 1: n ■ MSE of different estimators for cellwise contamination for all three contamination 
settings with n = 100, p = 15 and uncorrelated predictors 




e = 0 

e = 0.01 

e = 0.02 

e = 0.05 

e = 0.1 



LS 

0.30 

23.96 

31.86 

35.97 

36.46 



S 

0.36 

1.08 

12.55 

32.44 

36.36 



MM 

0.33 

0.48 

0.88 

18.35 

34.52 

cont 

j^lj 

~ W(50,l) 

shooting S + BI 

0.43 

0.62 

0.84 

1.72 

5.37 



shooting S -|- skH 

0.55 

0.65 

0.80 

2.02 

5.61 



LS 

0.30 

22.41 

30.82 

36.16 

36.67 



S 

0.36 

0.99 

11.15 

31.21 

36.54 



MM 

0.33 

0.50 

0.99 

16.76 

33.63 

cont ^ 

^ WfO, 100^) 

shooting S + BI 

0.43 

0.62 

0.86 

2.00 

8.94 



shooting S -I- skH 

0.55 

0.66 

0.81 

2.21 

7.48 



LS 

0.30 

23.81 

31.74 

35.97 

36.47 



S 

0.36 

1.08 

12.69 

32.49 

36.37 



MM 

0.33 

0.49 

0.92 

18.50 

34.41 

cont ^ 

^ WfSO, 10^) 

shooting S + BI 

0.43 

0.62 

0.84 

1.76 

5.72 



shooting S -I- skH 

0.55 

0.65 

0.81 

2.05 

5.83 


contamination. They are reliable even for up to 10% of cellwise contamination, or around 
80% of rowwise contamination. The two shooting S-estimators perform comparably in 
this setting. 

Table [5] confirms for correlated predictors what is shown in Table [T] for uncorrel¬ 
ated ones. The only major difference is that for correlated predictors the shooting 
S-estimators already outperform the MM-estimator for 1% of cellwise contamination, 
even though the MM-estimator does not break down yet in this case. 

For rowwise contamination the situation is different (see Table [3|). Here, as known, 
MM and S-estimation give excellent results. The shooting S-estimators give only slightly 
higher values of MSE compared to the ordinary S-estimator, indicating that the shooting 
S-estimators can cope with rowwise contamination as well. Nevertheless, as the shooting 
S-estimator has been developed for cellwise contamination, we do not advise its usage if 
there is only rowwise contamination present. 

The shooting S-estimator can also cope with vertical outliers (see Table SD. It gives 
good results for all levels of contamination used here, although its MSE is slightly higher 
than for the S- and MM-estimators. The reason for the good performance of the shooting 
S-estimator is that the contamination in the response is present in the computation of 
each single coefficient 13j. Robustness of the ‘regression step’ leads to small weights Wij 
for all j. 

We may conclude that the shooting S-estimator is the only considered regression 
estimator that can deal with cellwise contamination above 2% in our simulation setting. 
The estimator also gives good results in presence of vertical outliers. If there are no 
outliers, there is a slight loss in efficiency compared to the other robust estimators. In a 
rowwise contamination setting, we advise the use of the usual S- and MM-estimator. 








Table 2: n ■ MSE of different estimators for cellwise contamination for all three contamination 
settings with n = 100, p = 15 and predictors with correlation matrix E 




e = 0 

e = 0.01 

e = 0.02 

e = 0.05 

e = 0.1 



LS 

1.28 

35.28 

39.47 

35.46 

36.02 



S 

1.53 

6.10 

21.57 

45.55 

40.45 



MM 

1.39 

2.82 

5.58 

26.88 

46.73 

cont 

j^lj 

~ W(50,l) 

shooting S + BI 

1.70 

2.28 

2.84 

3.55 

6.20 



shooting S -|- skH 

2.00 

2.26 

2.44 

3.66 

6.07 



LS 

1.28 

32.66 

38.84 

36.16 

36.54 



S 

1.53 

5.60 

19.17 

43.82 

40.40 



MM 

1.39 

2.93 

5.66 

25.74 

44.23 

cont ^ 

^ WfO, 100^) 

shooting S + BI 

1.70 

2.32 

2.95 

4.25 

10.03 



shooting S -I- skH 

2.00 

2.29 

2.53 

4.04 

7.74 



LS 

1.28 

34.98 

39.24 

35.47 

36.04 



S 

1.53 

6.15 

21.62 

45.52 

40.38 



MM 

1.39 

2.90 

5.55 

27.17 

46.51 

cont ^ 

^ WfSO, 10^) 

shooting S + BI 

1.70 

2.28 

2.86 

3.63 

6.67 



shooting S -I- skH 

2.00 

2.25 

2.45 

3.69 

6.23 


5. Real Data 

We evaluate the performance of the shooting S-estimator on real data sets and compare 
it to the LS, S- and MM-estimators. For all estimators, the tuning parameters are chosen 
as in SectionUl We choose the three data sets Cars93, Auto and Boston. When applying 
the shooting S-estimator, we declare a component of an observation, hence a cell in the 
data matrix, as an outlier if it gets a robustness weight below 0.5. If all components of 
an observation are flagged as outliers, we say that the whole observation is outlying. 

The Cars93 data, a selection of 1993 model cars, are included in the R package MASS. 
Omitting not fully observed data points, we are left with n = 82 observations. We ht the 
following model with p = 14 predictor variables of the Cars93 data (for the definition of 
the variables, see Table El in the appendix) 

PRICE =Po + PiMPG.C + P 2 MPG.H + P 3 ENG.SIZE + P^HP 

+ fl^RPM + PqREV.MILE + IIjFUEL.TANK + /IgLENGTH 
+ PqWHEELBASE + PioWIDTH + PnTURN + P 12 REAR.SEAT 
+ iSisLUGGAGE + EIGHT + error. 

The shooting S-estimator using a biweight loss downweights seven observations as a 
whole and detects outlying cells for another 19 observations. In contrast, the MM- 
estimator downweights the observations corresponding to these outlying cells as a whole, 
thereby loosing information. This information loss is especially visible when looking, for 
example, at observation 46, which receives the weight 0 by the MM-estimator, while the 
shooting S-estimator with biweight loss assigns a weight of about 1 to all components 
except the first component, which receives weight 0. 
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Table 3: n ■ MSE of different estimators for rowwise contamination for all three contamination 
settings with n = 100,p=15 and predictors with correlation matrix E 




e = 0 

e = 0.01 

e = 0.02 

£ = 0.05 

£ = 0.1 


LS 

1.28 

53.62 

53.96 

54.72 

54.80 


S 

1.53 

1.51 

1.50 

1.48 

1.48 

~ W(50, E) 

MM 

1.39 

1.40 

1.41 

1.44 

1.50 


shooting S -1- BI 

1.70 

1.66 

1.63 

1.63 

1.66 


shooting S -I- skH 

2.00 

1.91 

1.88 

1.77 

1.78 


LS 

1.28 

12.58 

22.82 

44.10 

56.41 


S 

1.53 

1.57 

1.58 

1.75 

2.27 

~ A/'(0,100=E) 

MM 

1.39 

1.43 

1.47 

1.58 

1.79 


shooting S -1- BI 

1.70 

1.69 

1.73 

1.98 

3.04 


shooting S -I- skH 

2.00 

1.95 

1.94 

1.98 

2.48 


LS 

1.28 

56.43 

56.28 

55.85 

49.89 


S 

1.53 

1.51 

1.50 

1.48 

1.48 

xr"‘ ~ ^(50, lO’^E) 

MM 

1.39 

1.40 

1.41 

1.44 

1.50 


shooting S -1- BI 

1.70 

1.66 

1.65 

1.67 

1.76 


shooting S -I- skH 

2.00 

1.91 

1.88 

1.79 

1.86 


Table 4: n-MSE of different estimators for vertical outliers with n = 100, p = 15 and predictors 
with correlation matrix E 



£ = 0 

£ = 0.01 

£ = 0.02 

£ = 0.05 

£ = 0.1 

LS 

1.28 

51.40 

97.69 

234.23 

438.65 

S 

1.53 

1.51 

1.50 

1.48 

1.48 

MM 

1.39 

1.40 

1.41 

1.44 

1.50 

shooting S -1- BI 

1.70 

1.73 

1.77 

1.88 

2.13 

shooting S -I- skH 

2.00 

2.03 

2.01 

2.10 

2.14 


see 


T he Auto data set i s included in Stata and can be downloaded from http: //www. stata-press. com/data/ 
StataCoro . 2013l |. It consists of n = 74 fully observed sales of vintage 1978 auto¬ 


mobiles in the United States (see Table [7] in the appendix). We fit the following model 
with p = 8 predictor variables 


PRICE =/3o + (IiMPG + P 2 HEADROOM + P^TRUNK + EIGHT 

+ fl^LENGTH + PgTURN + /lyDIS PLACE + P^GEAR + error. 


The shooting S-estimator with biweight loss downweights five observation as a whole and 
flags cells of another 17 observations as outliers. For instance, observations 12 (‘Chevrolet 
Cavalier’) and 13 (‘Chevrolet Corsica’) receive a weight of zero by MM and the ordinary 
S, while the shooting S-estimator using a biweight loss finds out that only component 2, 
the headroom, is outlying. Again, we conclude that the shooting S-estimator uses more 

information from the data than the MM-estimator or the ordin ary S-estimator. _ 

Th e third data set, the Boston housing data, originates from iHarrison and Rubinfeld 
1978l | and has been extensively analyzed in the robust statistics literature. The data 
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Table 5: Average norm distance {AND) for five estimators computed on three data sets and 
their contaminated versions 



observed data 

contaminated data 


Auto 

Cars93 

Boston 

Auto 

Cars93 

Boston 

LS 

0.388 

0.141 

0.024 

1.320 

0.325 

0.273 

S 

0.459 

0.172 

0.021 

0.697 

0.240 

0.223 

MM 

0.282 

0.213 

0.022 

0.346 

0.243 

0.179 

shooting S -I- BI 

0.607 

0.217 

0.033 

0.251 

0.228 

0.152 

shooting S -I- skH 

0.574 

0.186 

0.039 

0.658 

0.169 

0.138 


are available in the R package mlbench and contain various characteristics of houses, 
demographics, air pollution and geographical details on n = 506 census tracts in and 
nearby Boston. Table [5] in the appendix gives an overview of the definition of the 
variables {p = 9) in the model 


log{MEDV) =/3o + PiCRIM + (32NOX^ + + /3iAGE + log{DIS) 

+ ISgTAX + iSjPTRATIO + PsB + /Jg \og{LSTAT) + error. 


Belslev et al. 198ri| | discovered outlying behavior of census tracts lying in central area of 


Boston, concentrated in three neighborhoods. Applying the shooting S-estimator using 
a biweight loss to the full data set, we get similar results. The shooting S-estimator 
declares the observations from the neighborhoods Back Bay (365—370), Beacon Hill (371- 
373) and South Boston (394-406) as cellwise contaminated, with mainly the components 
corresponding to the variables RM and AGE indicated as outlying. The MM-estimator and 
the ordinary S-estimator downweight as a whole the observations of the neighborhoods 
Back Bay and Beacon Hill and half of the observations of South Boston, resulting in a 
loss of information. 


For each of the three data sets, we randomly choose 4/5th of the observations and 
compute all estimates on this training data set. This we repeat R = 500 times and 

(t) 

we compare the estimates on the training data sets f3 , for r = 1,... ,R, to the one 
computed on the full data set Adjusting for the different scales of the explanatory 

variables, we get what we call the Average Norm Distance (AND) 


AND(;3) 


1 

R 


R 


E 


\ 


jE(T’ 

i=i 


> MAD(yi,...,y„)2 ' 


( 10 ) 


A low value of AND is desired. Table [5] shows the results for all considered estimators 
on the three data sets. As pointed out by a referee, the AND criterion is a version of 
the Jackknife estimator of the variance, and it reflects the efficiency. Therefore, the low 
value of AND for the LS estimator is no surprise. The AND for the S- and MM-estimator 
are close to those of LS, and sometimes even slightly better. The shooting S-estimators 
have somehow larger values of the AND, but the loss in efficiency remains limited. 
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To investigate the robustness of the estimators, we randomly choose 5% of the cells of 
the data set and replace them with ) where jlj and aj denote the 

median and MAD of the jth column of the design matrix, respectively. This we repeat 

C^) 

R = 500 times and we compute the average norm difference as in m, where j3 are 
the estimates from the contaminated data and (3 is the estimate on the original data. 
Table [5] gives the results. Now the AND measures the robustness of the estimators, 
and the LS estimator clearly gives the worst results. The shooting S-estimators, and 
in particular when using a biweight loss, give the best results. They deal better with 
cellwise contamination than the ordinary S- and MM-estimator. 

We did not use a prediction error criterion to assess the performance of the different 
estimators. If the level of cellwise contamination is moderate to high, we expect that 
most observations contain contaminated cells. When forecasting, outlying components 
of an observation get full influence (which is not the case in estimation). Assessing the 
prediction error by cross-validation is then not reliable anymore, since the validation set 
contains too many observations with contaminated components. Using a robust cross- 
validation criterion, as a trimmed mean squared prediction error does not solve this 
problem, as far too many observations used for validation may have outlying components. 
Prediction for cell-wise contaminated observations is left as a topic for future research. 

6. Conclusion 

In this paper, we introduce a regression estimator applicable for cellwise contamination. 
It combines the ideas of ordinary regression S-estimation with the coordinate descent 
algorithm. Thereby the shooting S-estimator is able to use different weights for different 
components of an observation. In our simulations, it can deal with cellwise contamination 
up to 10%. 

Furthermore, the shooting S-estimator can also be used as a diagnostic tool. After 
computation of the shooting S-estimate, the entries of the weight matrix Wij help to 
distinguish between clean data and outliers, and even between cellwise and rowwise 
contamination. While high weights indicate a clean cell, low weights indicate contam¬ 
ination. If all components of the same observation get low weights, this means that all 
components are contaminated or that it is a vertical outlier. 

The efficiency of the shooting S-estimator can be improved by using a shooting MM- 
estimator instead. To obtain a shooting MM-estimator, the simple S-estimation step 
inside the algorithm needs to be replaced with a simple MM-estimation. In order to ex¬ 
plore this idea, the simulations of Section [J] were repeated for a shooting MM-estimator, 
using simple MM-estimation with 20% breakdown point and 95% efficiency at the nor¬ 
mal distribution. Preliminary results indicate that (i) the shooting MM-estimator gave 
generally lower values for mean squared error in the simulations of Section 0] than the 
shooting S-estimator; (ii) especially for clean data and small amounts of contamination, 
the improvement of the shooting MM-estimator over the shooting S-estimator was clearly 
visible; (iii) the shooting MM-estimator outperformed the ordinary MM-estimator in any 
setting where cellwise contamination was present. However, further development of the 
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shooting MM-estimator is necessary and is left for future research. 

Another idea worth considering is the application of an imputation method after 
performing shooting S-regression. Cells that are flagged as outliers can be s et as missing . 
On the data set containing missing values, regression can be performed [see lLittlel . Il992l |. 

Admittedly, our shooting S-estimator has problems with cellwise good leverage points, 
which are observations with large values in some single cells that do follow the regression 
model. The shooting S-estimator tends to flag the contaminated cells of the good leverage 
points as outliers when computing the starting values of the algorithm. However, if the 
data contain rowwise good leverage points, thus large values for all cells of observations 
that do follow the model, the shooting S-estimator behaves comparable to the other 
estimators (LS, S, MM) in our experiments. 

As the shooting S-algorithm deals with each variable separately, it can also be applied 
to data sets with a small sample size and even if n < p. A s uitable /o-function for thi s 
setting may be the linear quadratic quadratic (Iqq) function of iKoller and Stahell 201 ll |. 
as it has been shown to have high efficiency also for small sample sizes. When using 
the Iqq-function in the simulation setups in Section 01 the results are comparable to the 
ones with the other p-functions used there. 

Finally, the shooting S-estimator can be extended to a penalized shooting S-estimator. 
To the simple S-estimation in every variable, a penalty term J{l3) can be added. Possible 
choices for the penalty term are J{/3) = |/3| or J{/3) = /3^. The penalized version of the 
shooting S-estimator could be very useful in high-dimensional settings. 
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A. APPENDIX - Description of Variables for Real Data 
Examples 


Table 6: Variables of the Cars93 data 


Name 

Description 

PRICE 

Midrange Price (in $1,000) 

MPG.C 

City MPG (miles per US gallon by EPA rating) 

MPG.H 

Highway MPG (miles per US gallon by EPA rating) 

ENG.SIZE 

Engine displacement size in liters 

HP 

Maximum horsepower 

RPM 

Revolutions per minute at which maximum horsepower is achieved 

REV.MILE 

Number of revolutions of the engine needed for car to travel one 
mile in its highest gear 

FUEL.TANK 

Capacity of the fuel tank in US gallons 

LENGTH 

Length of the car in inches 

WHEELBASE 

Size of the wheelbase in inches 

WIDTH 

Width of the car in inches 

TURN 

U-turn space in feet 

REAR.se AT 

Rear seat room in inches 

LUGGAGE 

Luggage capacity in cubic feet 

WEIGHT 

Weight of the car in pounds 


Table 7; Variables of the Auto data 


Name 

Description 

PRICE 

MPG 

HEADROOM 

TRUNK 

WEIGHT 

LENGTH 

TURN 

DISPLACE 

GEAR 

Price in US-dollars 

Milage 

Head room in inches 

Trunk space in cubic feet 
Weight of the car in pounds 
Length of the car in inches 
U-turn space in feet 
Displacement in cubic inches 
Gear ratio 
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Table 8: Variables of the Boston data 


Name 

Description 

MEDV 

Median value of owner-occupied homes in USD lOOO’s 

CRIM 

Per capita crime rate by town 

NOX 

Nitric oxides concentration in parts per 10 million 

RM 

Average number of rooms per dwelling 

AGE 

Proportion of owner-occupied units built prior to 1940 

DIS 

Weighted distance to five Boston employment centres 

TAX 

Full-value property-tax rate per USD 10,000 

PTRATIO 

Pupil-Teacher ratio by town 

B 

Proportion of black population 

LSTAT 

Percentage of lower status population 
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